DEV-1381 use delta computation for dailies with hathifiles-database & report on those changes#22
DEV-1381 use delta computation for dailies with hathifiles-database & report on those changes#22
Conversation
- Change `hathifiles_database_full_update` to use `DeltaUpdate` for monthly and update hathifiles - Add `statistics` method for evaluating actual work done for a given hathifile
- Excise `ENV["HATHIFILES_MYSQL_CONNECTION"]` from `Hathifiles` initializer - Add idempotency and daily "no deletions" test to delta update spec
- Change all `HATHIFILES_MYSQL_*` env vars to `MARIADB_HATHIFILES_RW_*`
|
I have a reservation about allowing database credentials from ENV to be overridden by the DB::Connection Maybe just add a note to the README that relying on ENV is the preferred way to go. |
|
Haven't yet reviewed; based on the description I think before deploying (even in testing) we need to get https://github.com/hathitrust/ht_tanka/tree/database-secrets merged. |
aelkiss
left a comment
There was a problem hiding this comment.
The database env var / connection changes make sense to me. I need to spend some more time looking at the delta computation.
aelkiss
left a comment
There was a problem hiding this comment.
This all looks great to me; tests make sense; I appreciate all the comments/documentation in the delta_update class.
As previously mentioned we will need to make sure that the hathifiles pod has sufficient working space before doing this and also add in the new secrets/env vars (https://github.com/hathitrust/ht_tanka/pull/129).
This should be useful for holdings in terms of looking at possible new/changed items and updating its mapping from ocn to clusters as needed; thinking about the best way to get that data to holdings is future work.
|
Reminders:
|
|
FYI @niquerio This is a significant change to how hathifiles-database works on a daily basis. No rush to update to this version right away when merged, but you may want to take a look especially after we test this out on our end and make sure it all works. |
MonthlyUpdateclass toDeltaUpdatehathifiles_database_full_updateto useDeltaUpdatefor monthly and update hathifilesstatisticsmethod for evaluating actual work done for a given hathifiletempdirinstead of one for all of the files to be processed (may mitigate excessive disk usage on the first of month).commtricks we can get a count of records added vs updated, at the expense of additional time and storage. Fortunately, these additional operations won't happen if you don't calldelta_update_obj.statistics.HATHIFILES_MYSQL_CONNECTIONreplaced withkwargsdefaulting toENVHATHIFILES_MYSQL_*vars migrated toMARIADB_HATHIFILES_RW_*exe/files removed (hathifiles_database_full_updateis the only survivor).